Predicting Project Success for New York City Capital Projects

CS109b: Module E

Project Team 71

Authors:

  1. An Hoang
  2. Mark McDonald
  3. Michael Sedelmeyer

Introduction

i. About this notebook

ADD BRIEF DESCRIPTION OF WHAT THIS NOTEBOOK IS

(SAMPLE) This notebook contains a set of baseline linear models created to predict Budget_Change_Ratio and Schedule_Change_Ratio using the small set of valid predictors available in our original dataset (therefore, no engineered features were used here.

  • For an overview of the analytical sections of this notebook, please see the Notebook Contents index listed below this introduction.

i.i. A note about supporting notebooks

The code and results given in this notebook are only a summary of the work completed toward this analysis. Supplemental notebooks containing the complete EDA, data cleansing, feature engineering, and model exploration illustrated in this report can be found in the notebooks/ directory of the supporting GitHub project repository. The notebooks in that repository are designed to be run in sequential numbered order to reproduce the results shown here. In addition, throughout this report, we will provide links to specific supporting notebooks for further reference.

i.ii. A note about supporting custom python modules

As you will see in the imports section of this notebook, there are a number of custom python modules used to generate the results shown throughout this notebook. The code for these modules are stored separately in the src/ directory of the supporting GitHub repository for better source code version control, reproducing output across multiple notebooks, and to keep the length of this report to a manageable length.

ii. Research Question

Given the set of New York City Capital Projects change data, can we create a model that can accurately predict 3-year change in forecasted project budget and 3-year change in forecasted project duration using only the data available at the start of the project as our predictors?

In other words, using historical project data, can we predict how much the forecasted budget and duration of any given capital project run by the City of New York will deviate from it's original budgeted estimates by the end of year-3 for the project?

The significance of a model that can accurately address this question means, given any new project, project managers and city administrators could another tool at their disposal for objectively identifying potential budget and schedule risk at the start of a new city-run capital project. Such a tool can help to overcome common planning fallacies and ..... (TO ADD ADDITIONAL THEORETICAL TERMS FROM ACADEMIC PAPERS)

iii. Summary of findings

ADD AN EXECUTIVE SUMMARY OF OUR FINDINGS HERE (5-6 key points/findings)

SEE THE BASELINE LINEAR MODELS NOTEBOOK FOR AN EXAMPLE

  • Sigmoid scaled Budget_Start and Duration_Start predictor data sufficiently minimizes skew to provide the best predictive performance amongst our linear models.
  • Our two predicted outcome variables, 3-year project Budget_Change_Ratio and Schedule_Change_Ratio, both exhibit different predictive behavior.
  • While our linear models fail to predict Budget_Change_Ratio better than a naive model, as is evidenced by negative $R^2$ test score, our Schedule_Change_Ratio predictions do moderately well with our best test $R^2$ score exceeding 0.54.
  • On the otherhand, our smoothing spline GAM model with separately optimized $\lambda$ term penalties for each specific predicted outcome variable Budget_Change_Ratio and Schedule_Change_Ratio demonstrated enough improvement in our $R^2$ scores to offer hope that a combination of additional engineered features and a sufficiently expressive model, might give us some form of acceptable predictive performance.
"BASELINE+" Smoothing Spline GAM results with optimized term penalties by response variable and a label-encoded `Category` predictor:

    ['Budget_Start', 'Duration_Start', `Category_Code`]


    R-squared scores:

        Budget_Change_Ratio

            Training    0.8263
            Test        -0.8736

        Schedule_Change_Ratio

            Training    0.5800
            Test        0.5373

Imported Modules and Data

In [3]:
from IPython.display import HTML, Image, IFrame, Markdown

HTML(
    '''<script> code_show=true; function code_toggle() {
    if (code_show){
    $('div.input').show();
    } else {
    $('div.input').hide();
    }
    code_show = !code_show
    } 
    $( document ).ready(code_toggle);
    </script>
    <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
    </form>'''
)
Out[3]:
In [4]:
import os
import sys

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# general regressor, scaling, and scoring sklearn imports
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.decomposition import PCA
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, \
                                  RobustScaler, LabelEncoder

# clustering specific imports
from sklearn.cluster import KMeans
from gap_statistic import OptimalK
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_samples, silhouette_score

# statsmodels and pygam imports
import statsmodels.formula.api as sm
from pygam import LinearGAM, s, f

# import custom .py functions from src/ directory 
sys.path.append('..')
from src.datagen import print_interval_dict
from src.scale import scale_features, sigmoid, log_plus_one, encode_categories
from src.model import generate_model_dict, print_model_results
from src.visualize import plot_true_pred, plot_bdgt_sched_scaled, \
                          plot_change_trend, plot_gam_by_predictor, \
                          plot_line, plot_value_counts
from src.cluster import silplot, display_gapstat_with_errbars, \
                        fit_neighbors, plot_epsilon, silscore_dbscan, \
                        fit_dbscan, print_dbscan_results


# Avoid scientific notation output in Pandas
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.display.float_format = '{:,.2f}'.format

# Improve resolution of output graphics
%config InlineBackend.figure_format ='retina'
In [5]:
# establish file paths and check to ensure target files exist
filepath_train = '../data/processed/NYC_capital_projects_3yr_final_train.csv'
filepath_test = '../data/processed/NYC_capital_projects_3yr_final_test.csv'
filepath_full = '../data/interim/Capital_Projects_clean.csv'
filepath_allyears = '../data/interim/NYC_capital_projects_all.csv' 

error = None
for filepath in [filepath_train, filepath_test, filepath_full, filepath_allyears]: 
    if not os.path.isfile(filepath):
        error = 1
        print(
            "ERROR - the following target file does not exist:\n\n\t{}\n"\
            "".format(filepath)
        )

if error is None:
    print("OK - all filepaths point to existing files!")
OK - all filepaths point to existing files!
In [6]:
# load dataframes from target files and set datetime columns
data_train = pd.read_csv(filepath_train)
data_test = pd.read_csv(filepath_test)
data_full = pd.read_csv(filepath_full)
data_allyears = pd.read_csv(filepath_allyears)

datetime_cols = [
    'Design_Start',
    'Final_Change_Date',
    'Schedule_Start',
    'Schedule_End',
]

for col in datetime_cols:
    data_train[col] = pd.to_datetime(data_train[col])
    data_test[col] = pd.to_datetime(data_test[col])
    data_allyears[col] = pd.to_datetime(data_allyears[col])
In [7]:
def print_record_project_count(dataframe, dataset='full'):
    """Prints summary of records and unique projects in dataframe
    """
    if dataset=='full':
        print(
            'For the ORIGINAL cleansed data, containing all available NYC capital '\
            'projects change records:\n'
        )

    elif dataset=='all':
        print(
            'For the data containing start and end data for all available '\
            'NYC capital projects for the ENTIRE INTERVAL of changes '\
            'covered in the ORIGINAL data:\n'
        )
        
    else:
        print(
            'For the final {} data, containing the {} split of 3-year '\
            'project data used in this analysis:\n'.format(
                dataset.upper(), dataset
            )
        )    
    
    # entries
    print(f"\tNumber of dataset records: {len(dataframe)}")

    # num projects
    print(
        f"\tNumber of unique projects in dataset: {dataframe['PID'].nunique()}\n"
    )
    

print_record_project_count(data_full, 'full')
print_record_project_count(data_allyears, 'all')
print_record_project_count(data_train, 'training')
print_record_project_count(data_test, 'test')
For the ORIGINAL cleansed data, containing all available NYC capital projects change records:

	Number of dataset records: 2095
	Number of unique projects in dataset: 355

For the data containing start and end data for all available NYC capital projects for the ENTIRE INTERVAL of changes covered in the ORIGINAL data:

	Number of dataset records: 355
	Number of unique projects in dataset: 355

For the final TRAINING data, containing the training split of 3-year project data used in this analysis:

	Number of dataset records: 134
	Number of unique projects in dataset: 134

For the final TEST data, containing the test split of 3-year project data used in this analysis:

	Number of dataset records: 15
	Number of unique projects in dataset: 15

In [8]:
print(
    'The shapes of our loaded train-test splits are:\n\n'\
    '\tTrain\t{}\n\tTest\t{}\n'.format(
        data_train.shape, data_test.shape
    )
)
The shapes of our loaded train-test splits are:

	Train	(134, 53)
	Test	(15, 53)

In [9]:
print(
    'For future reference, here is an overview of the features '\
    'contained in our final TRAINING dataset:\n'
)

data_train.info()

print('\n\n...and, here are the first 3 rows of our TRAINING data:')
data_train.head(3)
For future reference, here is an overview of the features contained in our final TRAINING dataset:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 134 entries, 0 to 133
Data columns (total 53 columns):
Unnamed: 0                    134 non-null int64
PID                           134 non-null int64
Project_Name                  134 non-null object
Description                   134 non-null object
Category                      134 non-null object
Borough                       134 non-null object
Managing_Agency               134 non-null object
Client_Agency                 134 non-null object
Phase_Start                   134 non-null object
Current_Project_Years         134 non-null float64
Current_Project_Year          134 non-null int64
Design_Start                  134 non-null datetime64[ns]
Budget_Start                  134 non-null float64
Schedule_Start                134 non-null datetime64[ns]
Final_Change_Date             134 non-null datetime64[ns]
Final_Change_Years            134 non-null float64
Phase_End                     134 non-null object
Budget_End                    134 non-null float64
Schedule_End                  134 non-null datetime64[ns]
Number_Changes                134 non-null int64
Duration_Start                134 non-null int64
Duration_End                  134 non-null int64
Schedule_Change               134 non-null int64
Budget_Change                 134 non-null float64
Schedule_Change_Ratio         134 non-null float64
Budget_Change_Ratio           134 non-null float64
Budget_Abs_Per_Error          134 non-null float64
Budget_Rel_Per_Error          134 non-null float64
Duration_End_Ratio            134 non-null float64
Budget_End_Ratio              134 non-null float64
Duration_Ratio_Inv            134 non-null float64
Budget_Ratio_Inv              134 non-null float64
Category_Old                  134 non-null object
Bridges                       134 non-null int64
Ferries                       134 non-null int64
Industrial_Development        134 non-null int64
Parks                         134 non-null int64
Sanitation                    134 non-null int64
Schools                       134 non-null int64
Sewers                        134 non-null int64
Streets_and_Roadways          134 non-null int64
Wastewater_Treatment          134 non-null int64
Water_Supply                  134 non-null int64
Category_Code                 134 non-null int64
umap_descr_2D_embed_1         134 non-null float64
umap_descr_2D_embed_2         134 non-null float64
umap_attributes_2D_embed_1    134 non-null float64
umap_attributes_2D_embed_2    134 non-null float64
attribute_clustering_label    134 non-null int64
ae_descr_embed_1              134 non-null float64
ae_descr_embed_2              134 non-null float64
pca_descr_embed_1             134 non-null float64
pca_descr_embed_2             134 non-null float64
dtypes: datetime64[ns](4), float64(21), int64(19), object(9)
memory usage: 55.6+ KB


...and, here are the first 3 rows of our TRAINING data:
Out[9]:
Unnamed: 0 PID Project_Name Description Category Borough Managing_Agency Client_Agency Phase_Start Current_Project_Years ... Category_Code umap_descr_2D_embed_1 umap_descr_2D_embed_2 umap_attributes_2D_embed_1 umap_attributes_2D_embed_2 attribute_clustering_label ae_descr_embed_1 ae_descr_embed_2 pca_descr_embed_1 pca_descr_embed_2
0 0 204 Edgewood Triangle Roadway Reconstruction Reconstruct roadway and extend storm and sanit... Streets and Roadways Queens DDC DOT 2-Design 6.55 ... 7 0.46 -0.41 12.32 10.59 4 6.10 4.54 6.02 3.01
1 1 577 Bronx Public School 19 Addition Design and construction of a new school Schools Bronx SCA DOE 2-Design 2.94 ... 5 -12.40 -3.70 8.39 -1.15 0 0.00 -0.01 -11.32 -2.94
2 2 664 NEW STRM SWR & WM REPLACEMENT IN ACACIA AVE, ETC. New Strom Sewer extension & Water Main replace... Sewers Staten Island DDC DEP 3-Construction Procurement 4.75 ... 6 -0.47 -0.43 16.50 11.85 5 5.16 3.84 4.00 0.82

3 rows × 53 columns

1. About the data

Index

The unabridged notebooks used to generate the findings in this section can be found here on GitHub and here on GitHub.

In [ ]:
 
In [ ]:
 
In [ ]:
 

2. Research Question

Index

In [ ]:
 
In [ ]:
 
In [ ]:
 

3.1. Reference class clustering with K-means and UMAP

Return to section index

IN THIS SECTION:

3.1.1. K-means clustering for reference class labels

3.1.2. UMAP clustering for reference class labels

In [ ]:
 

3.1.1. K-means clustering for reference class labels

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 

3.1.2. HDBSCAN clustering for reference class labels

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

Disadvantages of PCA, Kmeans, DBSCAN

How UMAP works

In [10]:
from dataclasses import dataclass
from pickle import dump, load
import plotly.io as pio
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from IPython.display import Image, SVG
init_notebook_mode()
pio.renderers.keys()
pio.renderers.default = 'jupyterlab'

scaler = load(open("../data/interim/UMAP_fitted_robust_scaler.pkl", "rb"))
clusterer = load(open("../data/interim/UMAP_fitted_HDBSCAN.pkl", "rb"))
mapper_dict = load(open("../data/interim/UMAP_mapper_dict", "rb"))
final_cols = load(open("../data/interim/UMAP_final_cols", "rb"))
embedding_file = "../data/processed/embeddings_uncased_L-2_H-128_A-2.csv"
bert_embedding = pd.read_csv(embedding_file).round(2)
embedding_expanded = bert_embedding["embedding"].str.split(",", expand=True)
embedding_expanded.columns = [f"description_embedding_original" for num in embedding_expanded.columns]
bert_embedding = bert_embedding.join(embedding_expanded)
bert_embedding = bert_embedding.drop_duplicates("PID")
bert_embedding = bert_embedding.drop(columns="embedding").reset_index().drop(columns="index")

class UMAP_embedder():
    def __init__(self):
        #self.initial_columns = columns
        self.initial_columns = ['PID', 'Project_Name', 'Description', 'Category', 'Borough', 'Managing_Agency', 'Client_Agency', 'Phase_Start', 'Current_Project_Years', 'Current_Project_Year', 'Design_Start', 'Budget_Start', 'Schedule_Start', 'Final_Change_Date', 'Final_Change_Years', 'Phase_End', 'Budget_End', 'Schedule_End', 'Number_Changes', 'Duration_Start', 'Duration_End', 'Schedule_Change', 'Budget_Change', 'Schedule_Change_Ratio', 'Budget_Change_Ratio', 'Budget_Abs_Per_Error', 'Budget_Rel_Per_Error', 'Duration_End_Ratio', 'Budget_End_Ratio', 'Duration_Ratio_Inv', 'Budget_Ratio_Inv']
        #self.scale_cols = df_to_transform.columns
        self.scale_cols = ['Current_Project_Years', 'Current_Project_Year', 'Budget_Start',
       'Final_Change_Years', 'Budget_End', 'Number_Changes', 'Duration_Start',
       'Duration_End', 'Schedule_Change', 'Budget_Change',
       'Schedule_Change_Ratio', 'Budget_Change_Ratio', 'Budget_Abs_Per_Error',
       'Budget_Rel_Per_Error', 'Duration_End_Ratio', 'Budget_End_Ratio',
       'Duration_Ratio_Inv', 'Budget_Ratio_Inv']
        self.scaler = scaler
        self.cols_to_dummify = ['Borough', 'Category', 'Client_Agency', 'Managing_Agency',
       'Phase_Start', 'Budget_Start', 'Duration_Start'] 
        #self.cols_to_dummify = columns_before_dummified
        self.final_cols = final_cols
        self.mapper_dict = mapper_dict
        self.clusterer = clusterer
        self.embedding = bert_embedding
        
    def get_mapping_attributes(self,df, return_extra=False, dimensions="all"):
        """
        if return extra = True, returns 3 objects:
            0. mapping
            1. columns needed to be added to harmonize with entire data
            2. dummified df before adding columns of [1]
        """
        raw_df = df[test_class.initial_columns]
        df_to_transform = df[self.scale_cols]#.drop(columns=["PID"])
        transformed_columns = pd.DataFrame(self.scaler.transform(df_to_transform), columns = df_to_transform.columns)
        scaled_df = (df[df.columns.difference(transformed_columns.columns)]).join(transformed_columns)
        dummified = pd.get_dummies(scaled_df[self.cols_to_dummify])
        
        added_cols = all_cols_all_datasets - set(dummified.columns)
        added_cols = {col: 0 for col in added_cols}
        
        dummified_full = dummified.assign(**added_cols)
        dummified_full = dummified_full[self.final_cols]
        mapping_df_list =[]
        mapper_list = mapper_dict["attributes"].values() if dimensions == "all" else [mapper_dict["attributes"][dimension] for dimension in dimensions]

        for mapper in mapper_list:
            mapping = mapper.transform(dummified_full)
            mapping_df = pd.DataFrame(mapping, columns= [f"umap_attributes_{mapping.shape[1]}D_embed_{col+1}" for col in range(mapping.shape[1])])
            mapping_df_list.append(mapping_df)
            
        final_df = pd.concat(mapping_df_list, axis=1)
        final_df["PID"] = scaled_df["PID"]
        
        if return_extra:
            return final_df, added_cols, scaled_df, dummified
        else:
            return final_df
       
    def get_mapping_description(self, df, dimensions= "all"):
        
        merged = df[["PID"]].merge(self.embedding, on = "PID", how="left").drop(columns="PID")
        mapping_df_list =[merged]
        #mapping_columns = [list(self.embedding.columns.copy())]
        mapper_list = mapper_dict["description"].values() if dimensions == "all" else [mapper_dict["description"][dimension] for dimension in dimensions]
        for mapper in mapper_list:
            mapping = mapper.transform(merged)
            mapping_df = pd.DataFrame(mapping, columns= [f"umap_descr_{mapping.shape[1]}D_embed_{col+1}" for col in range(mapping.shape[1])])
            mapping_df_list.append(mapping_df)
           # mapping_columns += list(mapping_df.columns.copy())
                                   
        final_df = pd.concat(mapping_df_list, axis=1)
        final_df["PID"] = df["PID"].values
        
        return final_df
    
    def get_full_df(self, df, dimensions="all"):
        attribute_df = self.get_mapping_attributes(df,dimension="all")
        description_df = self.get_mapping_description(df)
        labels, probabilities = self.get_clustering(attribute_df[["umap_attributes_2D_embed_1", "umap_attributes_2D_embed_2"]])
        full_df = description_df.merge(attribute_df, on = "PID", how="left")
        full_df["PID"] = attribute_df["PID"].values
        full_df["attribute_clustering_label"] = labels
        return full_df
    
    def get_clustering(self, attributes_2D_mapping):
        assert attributes_2D_mapping.shape[1] ==2
        new_labels = hdbscan.approximate_predict(clusterer, attributes_2D_mapping)
        return new_labels
        
UMAP_transformer = UMAP_embedder()
In [11]:
plot_df = data_allyears.merge(bert_embedding[["PID"]], on = "PID")
filter_cond = plot_df["Budget_Change_Ratio"].replace([-np.inf, np.inf], np.nan).notnull()
color = plot_df["Borough"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")

fig = px.scatter(x=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_1"], y=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_2"],
           size=size,
            symbol= symbol,
           color = color, opacity=0.8,
           color_continuous_scale=px.colors.diverging.Portland_r,
            render_mode= "svg",
           labels={"color":color.name, "size": size.name, "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
            title= "2D UMAP projections of all projects description's BERT embeddings.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
            width=1500, height=1000)

fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
SVG(fig.to_image(format= "svg"))
Out[11]:
−10−505101520−10−50510Borough=Brooklyn, Category=Wastewater TreatmentBorough=Brooklyn, Category=Water SupplyBorough=Brooklyn, Category=BridgesBorough=Brooklyn, Category=Bridges, Streets and RoadwaysBorough=Brooklyn, Category=Streets and RoadwaysBorough=Brooklyn, Category=SewersBorough=Brooklyn, Category=SanitationBorough=Brooklyn, Category=Arts and CultureBorough=Brooklyn, Category=Industrial DevelopmentBorough=Brooklyn, Category=Other Government FacilitiesBorough=Brooklyn, Category=Public Safety and Criminal JusticeBorough=Brooklyn, Category=Health and HospitalsBorough=Brooklyn, Category=SchoolsBorough=Queens, Category=Wastewater TreatmentBorough=Queens, Category=Water SupplyBorough=Queens, Category=BridgesBorough=Queens, Category=Bridges, Streets and RoadwaysBorough=Queens, Category=Streets and RoadwaysBorough=Queens, Category=SewersBorough=Queens, Category=SanitationBorough=Queens, Category=Industrial DevelopmentBorough=Queens, Category=Other Government FacilitiesBorough=Queens, Category=Public Safety and Criminal JusticeBorough=Queens, Category=SchoolsBorough=Carmel, Category=Water SupplyBorough=Staten Island, Category=Wastewater TreatmentBorough=Staten Island, Category=Water SupplyBorough=Staten Island, Category=BridgesBorough=Staten Island, Category=Streets and RoadwaysBorough=Staten Island, Category=SewersBorough=Staten Island, Category=SanitationBorough=Staten Island, Category=SchoolsBorough=Manhattan, Category=Wastewater TreatmentBorough=Manhattan, Category=Water SupplyBorough=Manhattan, Category=ParksBorough=Manhattan, Category=Bridges, Streets and RoadwaysBorough=Manhattan, Category=Streets and RoadwaysBorough=Manhattan, Category=SanitationBorough=Manhattan, Category=Arts and CultureBorough=Manhattan, Category=Other Government FacilitiesBorough=Manhattan, Category=Public Safety and Criminal JusticeBorough=Manhattan, Category=Health and HospitalsBorough=not_specified, Category=Wastewater TreatmentBorough=not_specified, Category=Water SupplyBorough=not_specified, Category=ParksBorough=not_specified, Category=BridgesBorough=not_specified, Category=FerriesBorough=not_specified, Category=Bridges, Streets and RoadwaysBorough=not_specified, Category=Streets and RoadwaysBorough=not_specified, Category=SewersBorough=not_specified, Category=SanitationBorough=not_specified, Category=Arts and CultureBorough=not_specified, Category=Industrial DevelopmentBorough=not_specified, Category=Other Government FacilitiesBorough=not_specified, Category=Public Safety and Criminal JusticeBorough=not_specified, Category=Health and HospitalsBorough=not_specified, Category=SchoolsBorough=not_specified, Category=Parks, Streets and RoadwaysBorough=not_specified, Category=Industrial Development, ParksBorough=not_specified, Category=LibrariesBorough=not_specified, Category=Social ServicesBorough=Bronx, Category=Wastewater TreatmentBorough=Bronx, Category=Water SupplyBorough=Bronx, Category=BridgesBorough=Bronx, Category=Bridges, Streets and RoadwaysBorough=Bronx, Category=Streets and RoadwaysBorough=Bronx, Category=SewersBorough=Bronx, Category=SanitationBorough=Bronx, Category=Industrial DevelopmentBorough=Bronx, Category=Public Safety and Criminal JusticeBorough=Bronx, Category=SchoolsBorough=Bronx, Category=Industrial Development, Streets and RoadwaysBorough=Bronx, Manhattan, Category=BridgesBorough=Manhattan, Bronx, Category=BridgesBorough=Citywide, Category=Water SupplyBorough=Citywide, Category=FerriesBorough=Citywide, Category=Streets and RoadwaysBorough=Citywide, Category=SewersBorough=Citywide, Category=Industrial DevelopmentBorough=Upstate, Category=Water SupplyBorough=Marlboro, Category=Water SupplyBorough=New York, Category=Wastewater TreatmentBorough=New York, Category=Streets and RoadwaysBorough=Olive, Category=BridgesBorough=Port Jervis, Category=Wastewater TreatmentBorough=Brooklyn, Queens, Category=Streets and RoadwaysBorough=Manhatten, Category=Streets and RoadwaysBorough=Valhalla, Category=Water SupplyBorough=Valhalla, Category=Industrial DevelopmentBorough=Manhattan, Staten Island, Bronx, Category=FerriesBorough=Manhattan, Staten Island, Category=Ferries2D UMAP projections of all projects description's BERT embeddings. Color= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'UMAP component 1UMAP component 2
In [12]:
data_train.head()
Out[12]:
Unnamed: 0 PID Project_Name Description Category Borough Managing_Agency Client_Agency Phase_Start Current_Project_Years ... Category_Code umap_descr_2D_embed_1 umap_descr_2D_embed_2 umap_attributes_2D_embed_1 umap_attributes_2D_embed_2 attribute_clustering_label ae_descr_embed_1 ae_descr_embed_2 pca_descr_embed_1 pca_descr_embed_2
0 0 204 Edgewood Triangle Roadway Reconstruction Reconstruct roadway and extend storm and sanit... Streets and Roadways Queens DDC DOT 2-Design 6.55 ... 7 0.46 -0.41 12.32 10.59 4 6.10 4.54 6.02 3.01
1 1 577 Bronx Public School 19 Addition Design and construction of a new school Schools Bronx SCA DOE 2-Design 2.94 ... 5 -12.40 -3.70 8.39 -1.15 0 0.00 -0.01 -11.32 -2.94
2 2 664 NEW STRM SWR & WM REPLACEMENT IN ACACIA AVE, ETC. New Strom Sewer extension & Water Main replace... Sewers Staten Island DDC DEP 3-Construction Procurement 4.75 ... 6 -0.47 -0.43 16.50 11.85 5 5.16 3.84 4.00 0.82
3 3 402 Green Infrastructure Construction Design and construction of right-of-way green ... Sewers Citywide EDC DEP 2-Design 6.42 ... 6 12.30 -1.08 -1.78 15.23 1 5.83 4.28 5.14 -5.36
4 4 760 Construction Of Sanitary Sewers, Storm Sewers (Bm Construction of storm sewers, sanitary sewers,... Sewers not_specified DDC DEP 2-Design 3.01 ... 6 -1.27 -0.19 17.35 11.24 5 4.66 3.46 3.57 -1.23

5 rows × 53 columns

In [13]:
plot_df = data_train
filter_cond = data_train["PID"].notnull()
color = plot_df["Schedule_Change_Ratio"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")

fig = px.scatter(x=data_train[filter_cond]["umap_attributes_2D_embed_1"], y=data_train[filter_cond]["umap_attributes_2D_embed_2"],
           size=size,
            symbol= symbol,
           color = color, opacity=0.8,
           color_continuous_scale=px.colors.diverging.Portland_r,
            render_mode= "svg",
           labels={"color":color.name, "size": f"Log {size.name}", "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
            title= "2D UMAP projections of train set projects attributes.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
            width=1000, height=800)

fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_layout(legend_orientation="h")
SVG(fig.to_image(format= "svg"))
Out[13]:
051015051015Category=Streets and RoadwaysCategory=SchoolsCategory=SewersCategory=Industrial DevelopmentCategory=Wastewater TreatmentCategory=ParksCategory=SanitationCategory=Water SupplyCategory=BridgesCategory=Other Govt Facilities and ImprovementsCategory=Ferries00.20.40.60.811.21.4Schedule_Change_Ratio2D UMAP projections of train set projects attributes. Color= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'UMAP component 1UMAP component 2
In [14]:
plot_df = data_test
filter_cond = data_train["PID"].notnull()
color = plot_df["Schedule_Change_Ratio"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")

fig = px.scatter(x=plot_df[filter_cond]["umap_attributes_2D_embed_1"], y=plot_df[filter_cond]["umap_attributes_2D_embed_2"],
           size=size,
            symbol= symbol,
           color = color, opacity=0.8,
           color_continuous_scale=px.colors.diverging.Portland_r,
            render_mode= "svg",
           labels={"color":color.name, "size": f"Log {size.name}", "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
            title= "2D UMAP projections of train set projects attributes.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
            width=1000, height=800)

fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_layout(legend_orientation="h")
SVG(fig.to_image(format= "svg"))
<ipython-input-14-eff4cd175e2f>:8: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Out[14]:
0510156810121416Category=Streets and RoadwaysCategory=Wastewater TreatmentCategory=SewersCategory=SanitationCategory=FerriesCategory=Water Supply00.20.40.60.81Schedule_Change_Ratio2D UMAP projections of train set projects attributes. Color= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'UMAP component 1UMAP component 2

Say something about imbalance of categories/ Boroughs in train/test

In [15]:
set(data_train.Category).symmetric_difference(set(data_test.Category))
Out[15]:
{'Bridges',
 'Industrial Development',
 'Other Govt Facilities and Improvements',
 'Parks',
 'Schools'}
In [16]:
set(data_train.Borough).symmetric_difference(set(data_test.Borough))
Out[16]:
{'Brooklyn',
 'Brooklyn, Queens',
 'Manhattan',
 'Manhattan, Bronx',
 'Manhattan, Staten Island, Bronx',
 'Marlboro',
 'New York',
 'Olive',
 'Port Jervis',
 'Valhalla'}

Clustering UMAP projections

In [17]:
plot_df = data_train
filter_cond = data_train["PID"].notnull()
color = plot_df["attribute_clustering_label"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")

fig = px.scatter(x=plot_df[filter_cond]["umap_attributes_2D_embed_1"], y=plot_df[filter_cond]["umap_attributes_2D_embed_2"],
           size=size,
            symbol= symbol,
           color = color, opacity=0.8,
           color_continuous_scale=px.colors.diverging.Portland_r,
            render_mode= "svg",
           labels={"color":color.name, "size": f"Log {size.name}", "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
            title= "2D UMAP projections of train set projects attributes.\nColor= 'HDBSCAN clustering Label', Size = 'Budget Change Ratio', Symbol= 'Category'",
            width=1000, height=800)

fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
fig.update_layout(legend_orientation="h")
SVG(fig.to_image(format= "svg"))
Out[17]:
051015051015Category=Streets and RoadwaysCategory=SchoolsCategory=SewersCategory=Industrial DevelopmentCategory=Wastewater TreatmentCategory=ParksCategory=SanitationCategory=Water SupplyCategory=BridgesCategory=Other Govt Facilities and ImprovementsCategory=Ferries−1012345attribute_clustering_label2D UMAP projections of train set projects attributes. Color= 'HDBSCAN clustering Label', Size = 'Budget Change Ratio', Symbol= 'Category'UMAP component 1UMAP component 2

Gap statistic, Silhouette, Trees

Understand clustering labels

In [70]:
from math import pi
def make_spider(mean_peaks_per_cluster, row, title, color):
    # number of variable
    categories=list(mean_peaks_per_cluster)[1:]
    N = len(categories)

    # What will be the angle of each axis in the plot? (we divide the plot / number of variable)
    angles = [n / float(N) * 2 * pi for n in range(N)]
    angles += angles[:1]

    # Initialise the spider plot
    ax = plt.subplot(4,4,row+1, polar=True, )

    # If you want the first axis to be on top:
    ax.set_theta_offset(pi / 2)
    ax.set_theta_direction(-1)

    # Draw one axe per variable + add labels labels yet
    plt.xticks(angles[:-1], categories, color='grey', size=8)

    # Draw ylabels
    ax.set_rlabel_position(0)
    #plt.yticks([10,20,30], ["10","20","30"], color="grey", size=7)
    #plt.ylim(0,40)

    # Ind1
    scaled = mean_peaks_per_cluster.loc[row].drop('group').values
    values=mean_peaks_per_cluster.loc[row].drop('group').values.flatten().tolist()
    values += values[:1]
    ax.plot(angles, values, color=color, linewidth=2, linestyle='solid')
    ax.fill(angles, values, color=color, alpha=0.4)

    # Add a title
    plt.title(title, size=11, color=color, y=1.1)

    
# ------- PART 2: Apply to all individuals
# initialize the figure
my_dpi=50
plt.figure(figsize=(1000/my_dpi, 1000/my_dpi), dpi=my_dpi + 40)

# Create a color palette:
my_palette = plt.cm.get_cmap("Set2", len(radar_plots_df.index))

# Loop to plot
for row in range(0, len(radar_plots_df.index)):
    make_spider( radar_plots_df, row=row, title='group '+radar_plots_df['group'][row].astype("str"), color=my_palette(row))
    
plt.tight_layout()

Will convert to static

In [67]:
px.histogram(test,x="Budget_metric_value",color = "label", facet_row = "Budget_metric", height=2000)
In [68]:
px.histogram(test,x="Schedule_metric_value",color = "label", facet_row = "Schedule_metric", height=2000)

3.2. Embedding project descriptions text with Bert

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 
In [ ]:
 

3.3.1. PCA dimension-reduced encoding of Bert embedded text

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 

3.3.2. Autoencoder dimension-reduced encoding of Bert embedded text

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 

3.3.3. UMAP dimension-reduced encoding of Bert embedded text

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
plot_df = data_allyears.merge(bert_embedding[["PID"]], on = "PID")
filter_cond = plot_df["Budget_Change_Ratio"].replace([-np.inf, np.inf], np.nan).notnull()
color = plot_df["Borough"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = np.log(plot_df["Budget_Change_Ratio"][filter_cond].abs() + 1)
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")

fig = px.scatter(x=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_1"], y=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_2"],
           size=size,
            symbol= symbol,
           color = color, opacity=0.8,
           color_continuous_scale=px.colors.diverging.Portland_r,
            render_mode= "svg",
           labels={"color":color.name, "size": size.name, "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
            title= "2D UMAP projections of all projects description's BERT embeddings.\nColor= 'Borough', Size = 'Budget Change Ratio', Symbol= 'Category'",
            width=1500, height=1000)

fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
SVG(fig.to_image(format= "svg"))
In [ ]:
plot_df = data_allyears.merge(bert_embedding[["PID"]], on = "PID")
filter_cond = plot_df["Budget_Change_Ratio"].replace([-np.inf, np.inf], np.nan).notnull()
color = plot_df["Borough"][filter_cond]
symbol = plot_df["Category"][filter_cond]
size = plot_df["Schedule_Change_Ratio"][filter_cond].abs()
UMAP_description_all_projects = UMAP_transformer.get_mapping_description(bert_embedding, dimensions=["2D"]).drop(columns="description_embedding_original")

fig = px.scatter(x=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_1"], y=UMAP_description_all_projects[filter_cond]["umap_descr_2D_embed_2"],
           size=size,
            symbol= symbol,
           color = color, opacity=0.8,
           color_continuous_scale=px.colors.diverging.Portland_r,
            render_mode= "svg",
           labels={"color":color.name, "size": size.name, "symbol":symbol.name, "x": "UMAP component 1", "y": "UMAP component 2"},
            title= "2D UMAP projections of all projects description's BERT embeddings.\nColor= 'Borough', Size = 'Schedule Change Ratio', Symbol= 'Category'",
            width=1500, height=1000)

fig.update_layout({
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})
SVG(fig.to_image(format= "svg"))
In [ ]:
 

Subset and scale data for X and y frames for all future models

In [ ]:
X_cols = [
    'Budget_Start',
    'Duration_Start',
    'Bridges',
    'Ferries',
    'Industrial_Development',
    'Parks',
    'Sanitation',
    'Schools',
    'Sewers',
    'Streets_and_Roadways',
    'Wastewater_Treatment',
    'Water_Supply',
    'Category_Code',
    'umap_descr_2D_embed_1',
    'umap_descr_2D_embed_2',
    'umap_attributes_2D_embed_1',
    'umap_attributes_2D_embed_2',
    'attribute_clustering_label',
    'ae_descr_embed_1',
    'ae_descr_embed_2',
    'pca_descr_embed_1',
    'pca_descr_embed_2',
    'attribute_km3_label'
]

y_cols = [
    'Budget_Change_Ratio',
    'Schedule_Change_Ratio'
]

X_train, y_train = data_train[X_cols], data_train[y_cols]
X_test, y_test = data_test[X_cols], data_test[y_cols]
In [ ]:
print('{}\t{}'.format(X_train.shape, X_test.shape))
print('{}\t{}'.format(y_train.shape, y_test.shape))
In [ ]:
X_train.info()
print()
y_train.info()
X_train.describe()
In [ ]:
#######################################
# CREATE SCALED DATAFRAMES
#######################################

# Identify binary variable columns to exclude from scaling
exclude_scale_cols = list(X_train)[2:]


# Standardize both X_train and X_test data, fitting X_train as the
# scaler for both
scaler = StandardScaler
scale_before_func = None
scale_after_func = None
reapply_scaler = False


X_train_std, Scaler_std = scale_features(
    X_train, X_train,
    exclude_scale_cols,
    scaler,
    scale_before_func,
    scale_after_func,
    reapply_scaler,
)

X_test_std, _ = scale_features(
    X_train, X_test,
    exclude_scale_cols,
    scaler,
    scale_before_func,
    scale_after_func,
    reapply_scaler,
)


# Standardize X_train and X_test, pass through sigmoid transformation
# and re-standardize to minimize skew of data
scaler = StandardScaler
scale_before_func = None
scale_after_func = sigmoid
reapply_scaler = True


X_train_std_sig, Scaler_std_sig = scale_features(
    X_train, X_train,
    exclude_scale_cols,
    scaler,
    scale_before_func,
    scale_after_func,
    reapply_scaler,
)

X_test_std_sig, _ = scale_features(
    X_train, X_test,
    exclude_scale_cols,
    scaler,
    scale_before_func,
    scale_after_func,
    reapply_scaler,
)
In [ ]:
# inspect scaled datasets
plot_bdgt_sched_scaled(X_train, X_train_std, 'Standardized', X_test, X_test_std)
plot_bdgt_sched_scaled(X_train, X_train_std_sig, 'Sigmoid standardized', X_test, X_test_std_sig)

FINDINGS:

By visualizing our Budget_Start and Duration_Start predictors above, we can see a large skew with clear outliers in the original unscaled data. By applying standardization to the these predictors, as we have illustrated in the upper righthand plot, we have set both variables to the same scale. However, standardizing has not alleviated the skewness of our data or helped with our outlying datapoints.

Therefore, we have also applied a sigmoid transformation to the data and re-standardized, as is shown in the lower righthand plot. This sigmoid transformation has helped to alleviate the skew of our data, and it has also helped to more evenly distrubute all of our data points, drawing outliers far closer to the center of the distribution.

Now we will fit a "Baseline" linear regression model on our scaled datasets to see which performs best.

4.1 Baseline Linear Regression

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

Fit "Baseline" linear regression model using only Budget_Start and Duration_Start predictors

In [ ]:
features = [
    'Budget_Start',
    'Duration_Start'
]

print(
    '\nThese 2 "BASELINE" models used the following predictors:\n\n\t{}\n\n'\
    ''.format(features)
)


sm_formulas = [
    ' + '.join(features),
    ' + '.join(features)
]

model_descr = 'Baseline linear regression, standardized data, 2 predictors'

model_LR2 = generate_model_dict(
    sm.ols,
    model_descr,
    X_train_std[features], X_test_std[features], y_train, y_test,
    multioutput=True,
    verbose=False,
    predictions=True,
    scores=True,
    model_api='statsmodels',
    sm_formulas=sm_formulas,
)

print_model_results(model_LR2)

model_descr = 'Baseline linear regression, sigmoid standarized data, 2 predictors'

model_LR2_sig = generate_model_dict(
    sm.ols,
    model_descr,
    X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
    multioutput=True,
    verbose=False,
    predictions=True,
    scores=True,
    model_api='statsmodels',
    sm_formulas=sm_formulas,
)


print_model_results(model_LR2_sig)

FINDINGS:

By using sigmoid scaled data for both our Budget_Start and Schedule_Start predictors, we can see that by reducing skew in our predictors, we have generated slightly better performance in predicting both Budget_Change_Ratio and Schedule_Change_Ratio in both our train and test sets.

While these are just true Baseline models in which we use Linear Regression with only 2 predictors, the results indicate that:

  1. we may benefit from the use of sigmoid scaled data for those two baseline predictors
  1. Budget_Change_Ratio may prove more difficult to predict than Schedule_Change_Ratio, wherein our predictions for Budget_Change_Ratio perform less well than a naive model as is indicated by the negative $R^2$ score for the test data.

Fit "Baseline+" regression model, incorporating project Category as a predictor

In [ ]:
features = list(X_train)[:-10]

print(
    '\nThis "BASELINE+" model uses the project Category '\
    'as one-hot-encoded predictors:'\
    '\n\n\t{}\n\n'.format(features)
)


sm_formulas = [
    ' + '.join(features),
    ' + '.join(features)
]

model_descr = 'Baseline linear regression, sigmoid standarized data, with categories'

model_LR3 = generate_model_dict(
    sm.ols,
    model_descr,
    X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
    multioutput=True,
    verbose=False,
    predictions=True,
    scores=True,
    model_api='statsmodels',
    sm_formulas=sm_formulas,
)


print_model_results(model_LR3)

FINDINGS:

By adding the one-hot-encoded Category feature to our model, which provides information about the type of project for each observation, we can see a clear improvement in our Schedule_Change_Ratio $R^2$ scores for both train and test predictions.

Our Budget_Change_Ratio results on the otherhand have degraded in performance when compared to our simpler Baseline model with just 2 predictors. Regardless, neither the Baseline nor Baseline+ linear regression models are able to predict 3-year Budget_Change_Ratio results better than the naive model, as is indicated by these $R^2$ results.

This indicates to us that we will likely have more difficulty in predicting Budget_Change_Ratio in our future models and that a Linear Regression model likely lacks the expressiveness required to adequately fit a model to the underlying relationship between predictors and outcome variable.

As on last step before moving on from Linear Regression, we will quickly inspect the predictions made by our Baseline+ model, as well as the regression coefficients.

Visualize Baseline+ predictions and coefficients

In [ ]:
plot_true_pred(model_dict=model_LR2, dataset='train')
plot_true_pred(model_dict=model_LR2, dataset='test')

ADD PLOT OF COEFFICIENTS & INTERPRETATION

4.2 Smoothing Spline Generalized Additive Models (GAMs)

Return to section index

IN THIS SECTION:

4.2.1. Smoothing Spline GAMs with Baseline Predictors

4.2.2. Smoothing Spline GAMs with Engineered Features

4.2.1. Smoothing Spline GAMs with Baseline Predictors

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
features = [
    'Budget_Start',
    'Duration_Start',
    'Category_Code',
]

print(
    '\nThis smoothing spline GAM uses the same predictors as our '\
    '"BASELINE+" regression model,\nexcept Category is label-encoded '\
    'instead of one-hot-encoded:'\
    '\n\n\t{}\n\n'.format(features)
)


model_descr = 'Smoothing spline GAM, sigmoid standarized data, with categories'

terms = s(0) + s(1) + f(2)

model_GAM0 = generate_model_dict(
    LinearGAM,
    model_descr,
    X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
    multioutput=False,
    verbose=False,
    predictions=True,
    scores=True,
    model_api='sklearn',
    terms=terms,
)


print_model_results(model_GAM0)

FINDINGS:

Here we can see that we stand to benefit from the added expressiveness of a smoothing spline class of linear model, as is illustrated by the improved $R^2$ results shown above. However, while our Schedule_Change_Ratio predictions have improved from $R^2=0.34$ to $R^2=0.58$, our Budget_Change_Ratio test $R^2$ score is still extremely negative. However, the smoothing spline GAM does appear to have fitted the training set with a Budget_Change_Ratio $R^2$ score of 0.54, which is somewhat promising. Now, if only we can improve on this to find a model that generalizes when to unseen data.

Use gridsearch to find optimal value $\lambda$ for each term in the smoothing spline GAM model

  • Here we treat each y output independently, partly because PyGam's Linear GAM will not fit a multi-output model, but mostly because each y output behaves differently and we have found that different $\lambda$ values are required to optimize both output
  • We also use PyGam's native gridsearch method to choose our values $\lambda$.
    • Because there are so few instances of some project categories, traditional cross-validation using k-splits creates instances where some train instances are missing one or two categories
    • When that occurs, PyGam cannot fit a coefficient to that category and generates an error.
In [ ]:
terms = s(0) + s(1) + f(2)

# generate list of lambdas against which to perform gridsearch for
# each outcome variable and each of our 3 input predictors
lam_list = np.logspace(-3, 5, 10)
lams = [lam_list] * 3

# fit GAM to predict budget change ratio 
gam1 = LinearGAM(terms).fit(
    X_train_std_sig[features], y_train['Budget_Change_Ratio']
)
# perform gridsearch to find optimal lambdas for each term
gam1.gridsearch(
    X_train_std_sig[features], y_train['Budget_Change_Ratio'],
    lam=lams
)

# fit GAM to predict schedule change ratio
gam2 = LinearGAM(terms).fit(
    X_train_std_sig[features], y_train['Schedule_Change_Ratio']
)
# perform gridsearch to find optimal lambdas for each term
gam2.gridsearch(
    X_train_std_sig[features], y_train['Schedule_Change_Ratio'],
    lam=lams
)
In [ ]:
%%capture --no-stdout

print(
    '\nGAM gridsearch results for BUDGET_CHANGE_RATIO prediction model:\n'
)
gam1.summary()

print(
    '\n\n\nGAM gridsearch results for SCHEDULE_CHANGE_RATIO prediction model:\n'
)
gam2.summary()

ADD INTERPRETATION

In [ ]:
print(
    '\nThese smoothing spline GAMs have been fit using the optimal lambda penalties '\
    'for each term, one set of results show models optimized for BUDGET_CHANGE_RATIO '\
    'predictions, the other for SCHEDULE_CHANGE_RATIO predictions.\n\n'\
    'The predictors used are:\n\n\t{}\n\n'.format(features)
)


model_descr = 'Smoothing spline GAM: sigmoid scaled, BUDGET_CHANGE_RATIO optimal penalties'

terms = s(0, lam=0.001) + s(1, lam=100000) + f(2, lam=215.4435)

model_GAM1 = generate_model_dict(
    LinearGAM,
    model_descr,
    X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
    multioutput=False,
    verbose=False,
    predictions=True,
    scores=True,
    model_api='sklearn',
    terms=terms,
)

print_model_results(model_GAM1)

model_descr = 'Smoothing spline GAM: sigmoid scaled, SCHEDULE_CHANGE_RATIO optimal penalties'

terms = s(0, lam=100000) + s(1, lam=27.8256) + f(2, lam=3.5938)

model_GAM2 = generate_model_dict(
    LinearGAM,
    model_descr,
    X_train_std_sig[features], X_test_std_sig[features], y_train, y_test,
    multioutput=False,
    verbose=False,
    predictions=True,
    scores=True,
    model_api='sklearn',
    terms=terms,
)


# generate_model_dict(
#     LinearGAM, model_descr, X_le_tr_std_sig, X_le_te_std_sig, y_train, y_test, terms=terms, multioutput=False
# )

print_model_results(model_GAM2)

ADD INTERPRETATION

Visualize optimized baseline smoothing spline GAM predictions and contribution by predictor

In [ ]:
y_pred_train = np.hstack(
    [
        model_GAM1['predictions']['train'][:, 0].reshape(-1,1),
        model_GAM2['predictions']['train'][:, 1].reshape(-1,1)
    ]
)

y_pred_test = np.hstack(
    [
        model_GAM1['predictions']['test'][:, 0].reshape(-1,1),
        model_GAM2['predictions']['test'][:, 1].reshape(-1,1)
    ]
)
In [ ]:
Conclusionsmodel_descr = "Smoothing spline GAM: standardized sigmoid scale data and optimized $\lambda$'s"
y1_label = 'Budget Change Ratio'
y2_label = 'Schedule Change Ratio'

plot_true_pred(
    dataset='train',
    y_true=y_train,
    y_pred=y_pred_train,
    y1_label=y1_label,
    y2_label=y2_label,
    model_descr=model_descr
)
plot_true_pred(
    dataset='test',
    y_true=y_test,
    y_pred=y_pred_test,
    y1_label=y1_label,
    y2_label=y2_label,
    model_descr=model_descr
)

ADD INTERPRETATION

In [ ]:
plot_gam_by_predictor(
    model_dict=model_GAM1, model_index=0,
    X_data=X_train_std_sig[features], y_data=y_train,
    dataset='train'
)

print()
plot_gam_by_predictor(
    model_dict=model_GAM2, model_index=1,
    X_data=X_train_std_sig[features], y_data=y_train,
    dataset='train'
)

ADD INTERPRETATION & CONCLUSIONS

4.2.2. Smoothing Spline GAMs with Engineered Features

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 

5. Nonparametric Models

Index

IN THIS SECTION:

5.1. Decision tree regressors

5.2. Ensemble tree regressors with boosting

In [ ]:
 

5.1. Decision Tree Regressors

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 

5.2. Ensemble Tree Regressors with Boosting

Return to section index

The unabridged notebook used to generate the findings in this section can be found here on GitHub.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

6. Conclusions

Index

In [ ]:
 

7. Next steps & future work

Index

In [ ]:
import shap
import glob
shap.initjs()
In [ ]:
model_dict = {}
for file in glob.glob("../data/interim/pickled_models/"):
    model_dict[file] = 

8. Sources & References

Index

In [ ]:
 
In [ ]:
 

9. Appendix

Index

IN THIS SECTION:

9.1. Dictionary of features in final dataset

In [ ]:
 

9.1. Data Dictionary of Features In Final Dataset

Return to section index

The .csv file used to generate this data dictionary can be found here on GitHub.

In [ ]:
 
In [ ]: